colin raffel
Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Marinas, Ines Altemir, Kucherenko, Anastasiia, Sternfeld, Alexander, Kucharavy, Andrei
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
What Matters for Model Merging at Scale?
Yadav, Prateek, Vu, Tu, Lai, Jonathan, Chronopoulou, Alexandra, Faruqui, Manaal, Bansal, Mohit, Munkhdalai, Tsendsuren
Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.
BloombergGPT: A Large Language Model for Finance
Wu, Shijie, Irsoy, Ozan, Lu, Steven, Dabravolski, Vadim, Dredze, Mark, Gehrmann, Sebastian, Kambadur, Prabhanjan, Rosenberg, David, Mann, Gideon
The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.
Derivative Free Weight-space Ensembling
Recent work suggests that interpolating between the weights of two specialized language models can transfer knowledge between tasks in a way that multi-task learning cannot. However, very few have explored interpolation between more than two models, where each has a distinct knowledge base. In this paper, we introduce Derivative Free Weight-space Ensembling (DFWE), a new few-sample task transfer approach for open-domain dialogue. Our framework creates a set of diverse expert language models trained using a predefined set of source tasks. Next, we finetune each of the expert models on the target task, approaching the target task from several distinct knowledge bases. Finally, we linearly interpolate between the model weights using a gradient-free-optimization algorithm, to efficiently find a good interpolation weighting. We demonstrate the effectiveness of the method on FETA-Friends outperforming the standard pretrain-finetune approach.
Pop2Piano : Pop Audio-based Piano Cover Generation
Piano covers of pop music are enjoyed by many people. However, the task of automatically generating piano covers of pop music is still understudied. This is partly due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it challenging to apply the latest data-intensive deep learning-based methods. To leverage the power of the data-driven approach, we make a large amount of paired and synchronized {Pop, Piano Cover} data using an automated pipeline. In this paper, we present Pop2Piano, a Transformer network that generates piano covers given waveforms of pop music. To the best of our knowledge, this is the first model to generate a piano cover directly from pop audio without using melody and chord extraction modules. We show that Pop2Piano, trained with our dataset, is capable of producing plausible piano covers.
Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text Diacritization
Al-Rfooh, Bashar, Abandah, Gheith, Al-Rfou, Rami
Most of previous work on learning diacritization of the Arabic language relied on training models from scratch. In this paper, we investigate how to leverage pre-trained language models to learn diacritization. We finetune token-free pre-trained multilingual models (ByT5) to learn to predict and insert missing diacritics in Arabic text, a complex task that requires understanding the sentence semantics and the morphological structure of the tokens. We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering, reducing WER by 40%. We release our finetuned models for the greater benefit of the researchers in the community.
SPT: Semi-Parametric Prompt Tuning for Multitask Prompted Learning
Bari, M Saiful, Zhang, Aston, Zheng, Shuai, Shi, Xingjian, Zhu, Yi, Joty, Shafiq, Li, Mu
Pre-trained large language models can efficiently interpolate human-written prompts in a natural way. Multitask prompted learning can help generalization through a diverse set of tasks at once, thus enhancing the potential for more effective downstream fine-tuning. To perform efficient multitask-inference in the same batch, parameter-efficient fine-tuning methods such as prompt tuning have been proposed. However, the existing prompt tuning methods may lack generalization. We propose SPT, a semi-parametric prompt tuning method for multitask prompted learning. The novel component of SPT is a memory bank from where memory prompts are retrieved based on discrete prompts. Extensive experiments, such as (i) fine-tuning a full language model with SPT on 31 different tasks from 8 different domains and evaluating zero-shot generalization on 9 heldout datasets under 5 NLP task categories and (ii) pretraining SPT on the GLUE datasets and evaluating fine-tuning on the SuperGLUE datasets, demonstrate effectiveness of SPT.
Happy AI New Year! Global Researchers Reflect on 2019, Talk Trends for 2020
The year 2019 saw unprecedented growth in AI research, development and deployment. Great technical progress has been achieved in image recognition, image generation, natural language understanding and other fields; while challenges remain with data management, efficiency measurement, computational capacity and other issues. To welcome 2020 with some fresh AI perspectives, Synced spoke with global researchers from Google Brain, Sony AI, Alibaba affiliate Ant Financial (formerly known as Alipay), Israel-based AI processor company Habana (recently acquired by Intel), Russian tech giant Yandex, Vietnam's newly established research lab VinAI Research, French deep learning inference acceleration startup Mipsology, and China-based remote sensing data platform TerraQuanta. Colin Raffel, Senior Research Scientist, Google Brain In 2019 the community made huge progress on learning from limited labels. MixMatch, UDA, S4L, and ReMixMatch produced huge gains on standard semi-supervised learning benchmarks.
Happy AI New Year! Global Researchers Reflect on 2019, Talk Trends for 2020
The year 2019 saw unprecedented growth in AI research, development and deployment. Great technical progress has been achieved in image recognition, image generation, natural language understanding and other fields; while challenges remain with data management, efficiency measurement, computational capacity and other issues. To welcome 2020 with some fresh AI perspectives, Synced spoke with global researchers from Google Brain, Sony AI, Alibaba affiliate Ant Financial (formerly known as Alipay), Israel-based AI processor company Habana (recently acquired by Intel), Russian tech giant Yandex, Vietnam's newly established research lab VinAI Research, French deep learning inference acceleration startup Mipsology, and China-based remote sensing data platform TerraQuanta. Colin Raffel, Senior Research Scientist, Google Brain In 2019 the community made huge progress on learning from limited labels. MixMatch, UDA, S4L, and ReMixMatch produced huge gains on standard semi-supervised learning benchmarks.
Colin Raffel - Doing Strange Things with Attention - AI With The Best October 14-15, 2017
AI With The Best hosted 50 speakers and hundreds of attendees from all over the world on a single platform on October 14-15, 2017. The platform held live talks, Insights/Questions pages, and bookings for 1-on-1s with speakers. Colin is a Research Scientist (formerly a resident) at Google Brain, where he is working on unsupervised learning, machine learning security, and models for sequential data. He did his PhD at Columbia University in LabROSA, supervised by Dan Ellis. He also has a Master's from Stanford University's CCRMA and a Bachelor's from Oberlin College.